We are trying to determine which products have high association i.e. those that are likely to be bought together
We will imPlement this using Market Basket Analysis (MBA), which uses Association Rule Mining (Apriori algorithm)
\(M_1 \rightarrow M_2\) i.e. representation of having item \(M_2\) on the itemset which has \(M_1\) on it
Support -indicates how frequently the itemset occurs
Confidence - the number of times the rule is found to be true (i.e. likelines of occurrence of consequent given the the antecedent has occurred)
file_path <- "Import files/Market_Basket_Optimisation.csv"
data <- read_csv(file_path, col_names = FALSE)
Each row represents the transactions for individual customers
We convert the data frame to a sparse matrix (called transactions)
Sparse matrix is a matrix of 0s and 1s, with each row and column representing the various products
library(arules)
dataset <- read.transactions(file_path, sep = ",", rm.duplicates = TRUE )
## distribution of transactions with duplicates:
## 1
## 5
summary(dataset)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
library(RColorBrewer)
itemFrequencyPlot(dataset,
topN=20,
col=brewer.pal(8,'Pastel2'),
main='Relative Item Frequency Plot',
type="relative",
ylab="Item Frequency (Relative)")
The basic steps in implementing the Apriori algorithm are as follows:
Set up a minimum support and confidence
Take all the subsets in transactions having higher support than the minimum support
Take all the subsets in transactions having higher confidence than the minimum confidence
Sort the rules by decresing lift
The choice of support(how frequently the item appears in your data set) and confidence (frequency of the rule) varies by business case: depends on the goal, data size etc
For minimum support, we want products that are bought at least two times a day i.e. 2*7/len(dataset).
Minimum length is an specifies the minimum number of products you’d like to have in your rule (Not mandatory to include this)
Maximum length specifies the maximum number of products you’d like to have in your rule (Not mandatory to include this)
rules <- apriori(dataset,
parameter = list(support = 14/nrow(dataset),
confidence = .2,
minlen = 2,
maxlen = 20
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.001866418 2
## maxlen target ext
## 20 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [3193 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Removing Redundant Rules
#rules <- rules[!is.redundant(rules)]
subset.rules <- which(colSums(is.subset(rules, rules)) > 1) # get subset rules in vector
rules <- rules[-subset.rules] # remove subset rules.
rules_df <- DATAFRAME(rules) %>% arrange(desc(lift))
DT::datatable(rules_df)
library(arulesViz)
plot(rules,jitter = 0, engine = "plotly")
Rules with high confidence tend to have low support, and vice versa
Rules with high lift tend to have relatively low support
plot(rules, method = "two-key plot")
subrules <- head(sort(rules, by="lift"), n = 30, by = "lift")
#plot(subrules, method = "graph", engine = "htmlwidget")
plot(subrules, method = "graph",
control = list(
# edges = ggraph::geom_edge_link(
# end_cap = ggraph::circle(4, "mm"),
# start_cap = ggraph::circle(4, "mm"),
# color = "black",
# arrow = arrow(length = unit(2, "mm"), angle = 20, type = "closed"),
# alpha = .2
# ),
nodes = ggraph::geom_node_point(aes_string(size = "support", color = "lift"))
#nodetext = ggraph::geom_node_label(aes_string(label = "label"), alpha = .8, repel = TRUE)
)
) +
scale_color_gradient(low = "dodgerblue", high = "red") +
scale_size(range = c(2, 10))
# Shiny App for Interactive Manipulations and Visualization
#ruleExplorer(subrules, sidebarWidth = 2, graphHeight = '600px')